The SARTools R package which generated this report has been developped at PF2 - Institut Pasteur by M.-A. Dillies and H. Varet (). Thanks to cite H. Varet, L. Brillet-Guéguen, J.-Y. Coppee and M.-A. Dillies, SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data, PLoS One, 2016, doi: http://dx.doi.org/10.1371/journal.pone.0157022 when using this tool for any analysis published.

1 Introduction

The analyses reported in this document are part of the SARTools.DESeq2.Kallisto.Transcripts.batchSex.byGroup project. The aim is to find features that are differentially expressed between Control, Combination, Cort and Pregnanedione. The statistical analysis process includes data normalization, graphical exploration of raw and normalized data, test for differential expression for each feature between the conditions, raw p-value adjustment and export of lists of features having a significant differential expression between the conditions. In this analysis, the Sex effect will be taken into account in the statistical models.

The analysis is performed using the R software [1], Bioconductor [2] packages including DESeq2 [3,4] and the SARTools package developed at PF2 - Institut Pasteur. Normalization and differential analysis are carried out according to the DESeq2 model and package. This report comes with additional tab-delimited text files that contain lists of differentially expressed features.

For more details about the DESeq2 methodology, please refer to its related publications [3,4].

2 Description of raw data

The count data files and associated biological conditions are listed in the following table.

Table 1: Data files and associated biological conditions.
label file Treatment Egg_mass Group Sex
A11_Mem_12 A11-Mem-12_S35/abundance.tsv Control 65.56 A M
A12_Mem_13 A12-Mem-13_S38/abundance.tsv Control 65.27 B F
A7_Mem_7 A7-Mem-7_S23/abundance.tsv Control 63.21 A M
B11_Mem_27 B11-Mem-27_S36/abundance.tsv Control 70.29 B F
B5_Mem_19 B5-Mem-19_S18/abundance.tsv Control 69.21 A M
B7_Mem_21 B7-Mem-21_S24/abundance.tsv Control 66.20 B F
C10_Mem_42 C10-Mem-42_S34/abundance.tsv Control 67.15 B F
C2_Mem_30 C2-Mem-30_S7/abundance.tsv Control 72.07 B F
C4_Mem_32 C4-Mem-32_S15/abundance.tsv Control 61.62 B F
C9_Mem_40 C9-Mem-40_S31/abundance.tsv Control 63.83 B F
A10_Mem_10 A10-Mem-10_S32/abundance.tsv Combination 62.43 A M
B9_Mem_24 B9-Mem-24_S30/abundance.tsv Combination 64.66 B F
C1_Mem_29 C1-Mem-29_S3/abundance.tsv Combination 65.52 B F
C11_Mem_44 C11-Mem-44_S37/abundance.tsv Combination 68.94 A M
C7_Mem_36 C7-Mem-36_S25/abundance.tsv Combination 73.03 B F
C8_Mem_38 C8-Mem-38_S28/abundance.tsv Combination 63.55 A M
D1_Mem_46 D1-Mem-46_S4/abundance.tsv Combination 66.06 B F
D2_Mem_47 D2-Mem-47_S8/abundance.tsv Combination 67.75 A M
D3_Mem_49 D3-Mem-49_S12/abundance.tsv Combination 71.84 A M
D4_Mem_54 D4-Mem-54_S16/abundance.tsv Combination 70.20 B F
A3_Mem_3 A3-Mem-3_S9/abundance.tsv Cort 64.25 A M
B1_Mem_14 B1-Mem-14_S2/abundance.tsv Cort 62.92 A M
B10_Mem_25 B10-Mem-25_S33/abundance.tsv Cort 69.28 B F
B2_Mem_15 B2-Mem-15_S6/abundance.tsv Cort 60.19 B F
B3_Mem_16 B3-Mem-16_S10/abundance.tsv Cort 70.99 A M
B4_Mem_18 B4-Mem-18_S14/abundance.tsv Cort 68.25 B F
B8_Mem_22 B8-Mem-22_S27/abundance.tsv Cort 67.17 A M
C12_Mem_45 C12-Mem-45_S40/abundance.tsv Cort 64.93 B F
C3_Mem_31 C3-Mem-31_S11/abundance.tsv Cort 65.32 B F
C6_Mem_34 C6-Mem-34_S22/abundance.tsv Cort 72.14 B F
A1_Mem_1 A1-Mem-1_S1/abundance.tsv Pregnanedione 61.11 B F
A2_Mem_2 A2-Mem-2_S5/abundance.tsv Pregnanedione 72.16 A M
A4_Mem_4 A4-Mem-4_S13/abundance.tsv Pregnanedione 65.38 A M
A5_Mem_5 A5-Mem-5_S17/abundance.tsv Pregnanedione 65.01 B F
A6_Mem_6 A6-Mem-6_S20/abundance.tsv Pregnanedione 66.62 A M
A8_Mem_8 A8-Mem-8_S26/abundance.tsv Pregnanedione 65.80 B F
A9_Mem_9 A9-Mem-9_S29/abundance.tsv Pregnanedione 69.29 B F
B12_Mem_28 B12-Mem-28_S39/abundance.tsv Pregnanedione 68.55 B F
B6_Mem_20 B6-Mem-20_S21/abundance.tsv Pregnanedione 67.41 B F
C5_Mem_33 C5-Mem-33_S19/abundance.tsv Pregnanedione 71.40 B F

After loading the data we first have a look at the raw data table itself. The data table contains one row per annotated feature and one column per sequenced sample. Row names of this table are feature IDs (unique identifiers). The table contains raw count values representing the number of reads that map onto the features. For this project, there are 28756 features in the count data table.

Table 2: Partial view of the count data table.
A11_Mem_12 A12_Mem_13 A7_Mem_7 B11_Mem_27 B5_Mem_19 B7_Mem_21 C10_Mem_42 C2_Mem_30 C4_Mem_32 C9_Mem_40 A10_Mem_10 B9_Mem_24 C1_Mem_29 C11_Mem_44 C7_Mem_36 C8_Mem_38 D1_Mem_46 D2_Mem_47 D3_Mem_49 D4_Mem_54 A3_Mem_3 B1_Mem_14 B10_Mem_25 B2_Mem_15 B3_Mem_16 B4_Mem_18 B8_Mem_22 C12_Mem_45 C3_Mem_31 C6_Mem_34 A1_Mem_1 A2_Mem_2 A4_Mem_4 A5_Mem_5 A6_Mem_6 A8_Mem_8 A9_Mem_9 B12_Mem_28 B6_Mem_20 C5_Mem_33
ENSGALT00000000003.5 2 0 2 5 10 3 5 5 2 4 2 4 12 5 3 3 4 2 5 2 3 3 4 4 2 3 4 2 4 5 5 5 8 9 7 4 4 2 7 10
ENSGALT00000000012.7 55 61 58 55 99 79 55 67 68 79 57 60 81 57 78 75 74 52 55 72 70 69 60 76 51 70 75 67 72 89 74 56 70 50 110 56 45 63 87 66
ENSGALT00000000022.6 250 187 196 222 304 243 250 273 253 242 235 231 257 193 258 283 243 258 228 246 243 305 285 260 246 291 254 216 222 260 259 246 224 241 277 271 199 200 318 250
ENSGALT00000000051.6 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
ENSGALT00000000059.7 2 0 6 3 2 2 3 3 3 4 0 3 0 1 2 1 6 2 3 2 5 7 2 2 0 1 3 3 2 6 4 1 3 0 2 1 4 1 0 0
ENSGALT00000000063.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Looking at the summary of the count table provides a basic description of these raw counts (min and max values, median, etc).

Table 3: Summary of the raw counts.
A11_Mem_12 A12_Mem_13 A7_Mem_7 B11_Mem_27 B5_Mem_19 B7_Mem_21 C10_Mem_42 C2_Mem_30 C4_Mem_32 C9_Mem_40 A10_Mem_10 B9_Mem_24 C1_Mem_29 C11_Mem_44 C7_Mem_36 C8_Mem_38 D1_Mem_46 D2_Mem_47 D3_Mem_49 D4_Mem_54 A3_Mem_3 B1_Mem_14 B10_Mem_25 B2_Mem_15 B3_Mem_16 B4_Mem_18 B8_Mem_22 C12_Mem_45 C3_Mem_31 C6_Mem_34 A1_Mem_1 A2_Mem_2 A4_Mem_4 A5_Mem_5 A6_Mem_6 A8_Mem_8 A9_Mem_9 B12_Mem_28 B6_Mem_20 C5_Mem_33
Min. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1st Qu. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Median 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 2 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 1 1 3 2
Mean 111 99 106 86 137 102 111 122 124 114 101 117 113 94 99 116 119 113 112 116 127 121 123 125 114 127 119 104 104 117 104 114 108 116 123 117 94 112 123 122
3rd Qu. 34 33 32 29 45 33 36 40 39 37 33 37 35 30 33 39 33 33 34 37 40 36 39 40 39 42 41 34 33 39 35 33 35 38 43 40 31 34 46 39
Max. 124047 83649 91338 92056 91339 79913 80451 120146 90819 74708 66927 80931 116360 140311 63500 78507 290793 259297 162132 99781 99023 179870 122544 105604 68318 76310 72692 71539 121737 79967 98702 121839 74589 80157 103847 71497 63809 95654 75971 92866

Figure 1 shows the total number of mapped and counted reads for each sample. We expect total read counts to be similar within conditions, they may be different across conditions. Total counts sometimes vary widely between replicates. This may happen for several reasons, including:

  • different rRNA contamination levels between samples (even between biological replicates);
  • slight differences between library concentrations, since they may be difficult to measure with high precision.
Figure 1: Number of mapped reads per sample. Colors refer to the biological condition of the sample.

Figure 1: Number of mapped reads per sample. Colors refer to the biological condition of the sample.

Figure 2 shows the percentage of features with no read count in each sample. We expect this percentage to be similar within conditions. Features with null read counts in the 40 samples are left in the data but are not taken into account for the analysis with DESeq2. Here, 4956 features (17.23%) are in this situation (dashed line). Results for those features (fold-change and p-values) are set to NA in the results files.

Figure 2: Percentage of features with null read counts in each sample.

Figure 2: Percentage of features with null read counts in each sample.

Figure 3 shows the distribution of read counts for each sample. For sake of readability, \(\text{log}_2(\text{counts}+1)\) are used instead of raw counts. Again we expect replicates to have similar distributions. In addition, this figure shows if read counts are preferably low, medium or high. This depends on the organisms as well as the biological conditions under consideration.

Figure 3: Density distribution of read counts.

Figure 3: Density distribution of read counts.

It may happen that one or a few features capture a high proportion of reads (up to 20% or more). This phenomenon should not influence the normalization process. The DESeq2 normalization has proved to be robust to this situation [Dillies, 2012]. Anyway, we expect these high count features to be the same across replicates. They are not necessarily the same across conditions. Figure 4 and table 4 illustrate the possible presence of such high count features in the data set.

Figure 4: Percentage of reads associated with the sequence having the highest count (provided in each box on the graph) for each sample.

Figure 4: Percentage of reads associated with the sequence having the highest count (provided in each box on the graph) for each sample.

Table 4: Percentage of reads associated with the sequences having the highest counts.
A11_Mem_12 A12_Mem_13 A7_Mem_7 B11_Mem_27 B5_Mem_19 B7_Mem_21 C10_Mem_42 C2_Mem_30 C4_Mem_32 C9_Mem_40 A10_Mem_10 B9_Mem_24 C1_Mem_29 C11_Mem_44 C7_Mem_36 C8_Mem_38 D1_Mem_46 D2_Mem_47 D3_Mem_49 D4_Mem_54 A3_Mem_3 B1_Mem_14 B10_Mem_25 B2_Mem_15 B3_Mem_16 B4_Mem_18 B8_Mem_22 C12_Mem_45 C3_Mem_31 C6_Mem_34 A1_Mem_1 A2_Mem_2 A4_Mem_4 A5_Mem_5 A6_Mem_6 A8_Mem_8 A9_Mem_9 B12_Mem_28 B6_Mem_20 C5_Mem_33
ENSGALT00000012060.6 3.89 2.94 2.98 0.82 1.43 2.73 2.53 1.21 2.33 2.14 2.31 2.39 2.33 1.41 2.24 2.35 2.06 2.01 1.36 1.47 1.79 2.24 1.49 1.70 1.19 2.08 2.12 1.99 1.75 2.38 1.78 3.71 2.41 2.22 2.93 1.54 2.22 2.10 2.15 2.64
ENSGALT00000028027.6 3.11 2.25 2.10 0.62 1.19 2.13 1.94 0.96 1.84 1.54 1.87 1.85 1.81 1.05 1.83 1.93 1.74 1.69 1.17 1.10 1.45 1.87 1.16 1.39 1.04 1.76 1.65 1.60 1.50 2.02 1.45 3.18 1.96 1.74 2.44 1.24 1.76 1.72 1.69 2.31
ENSGALT00000068173.2 2.83 2.13 1.84 0.58 1.09 2.03 1.76 0.93 1.78 1.45 1.64 1.68 1.75 1.05 1.61 1.81 1.65 1.72 1.07 1.05 1.39 1.83 1.04 1.27 0.92 1.55 1.51 1.52 1.38 1.80 1.41 2.97 1.76 1.61 2.19 1.17 1.68 1.60 1.50 2.08
ENSGALT00000024434.5 2.50 1.99 2.92 1.65 2.32 2.10 1.76 1.69 2.04 1.88 2.25 2.40 1.22 1.74 1.29 2.13 1.89 1.78 2.68 2.18 2.71 2.32 1.52 1.92 2.01 1.66 1.40 2.06 1.48 2.08 1.47 2.74 1.95 2.40 1.91 2.12 2.37 2.98 1.21 2.38
ENSGALT00000103112.1 1.76 1.22 1.07 3.74 1.54 0.93 2.52 3.42 2.54 2.27 1.51 1.97 3.57 5.21 1.51 2.28 8.50 8.01 5.03 3.00 1.69 5.19 3.47 2.94 2.08 2.03 1.57 2.40 4.07 1.75 3.31 2.83 1.37 1.23 1.16 1.21 1.43 2.53 1.12 2.11
ENSGALT00000040832.4 1.51 1.20 1.11 1.62 1.41 0.95 2.12 2.06 1.83 1.80 1.34 1.66 1.54 2.87 1.33 1.67 2.39 2.49 2.46 2.01 1.32 1.71 1.91 1.75 1.56 1.70 1.36 1.70 2.33 1.46 1.29 1.76 1.33 1.11 1.31 1.11 1.24 1.59 1.08 1.67
ENSGALT00000101725.1 0.93 0.72 0.71 0.99 0.97 0.59 1.42 1.51 1.32 1.25 1.02 1.13 0.82 1.79 0.93 1.08 1.23 1.39 1.60 1.50 0.88 0.86 1.26 1.13 1.20 1.09 0.94 1.16 1.58 0.97 0.70 0.98 0.89 0.73 0.84 0.74 0.82 1.01 0.78 1.00

We may wish to assess the similarity between samples across conditions. A pairwise scatter plot is produced (figure 5) to show how replicates and samples from different biological conditions are similar or different (\(\text{log}_2(\text{counts}+1)\) are used instead of raw count values). Moreover, as the Pearson correlation has been shown not to be relevant to measure the similarity between replicates, the SERE statistic has been proposed as a similarity index between RNA-Seq samples [5]. It measures whether the variability between samples is random Poisson variability or higher. Pairwise SERE values are printed in the lower triangle of the pairwise scatter plot. The value of the SERE statistic is:

  • 0 when samples are identical (no variability at all: this may happen in the case of a sample duplication);

  • 1 for technical replicates (technical variability follows a Poisson distribution);

  • greater than 1 for biological replicates and samples from different biological conditions (biological variability is higher than technical one, data are over-dispersed with respect to Poisson). The higher the SERE value, the lower the similarity. It is expected to be lower between biological replicates than between samples of different biological conditions. Hence, the SERE statistic can be used to detect inversions between samples.

Figure 5: Pairwise comparison of samples (not produced when more than 30 samples).

Figure 5: Pairwise comparison of samples (not produced when more than 30 samples).

3 Variability within the experiment: data exploration

The main variability within the experiment is expected to come from biological differences between the samples. This can be checked in two ways. The first one is to perform a hierarchical clustering of the whole sample set. This is performed after a transformation of the count data which can be either a Variance Stabilizing Transformation (VST) or a regularized log transformation (rlog) [3,4].

A VST is a transformation of the data that makes them homoscedastic, meaning that the variance is then independent of the mean. It is performed in two steps: (i) a mean-variance relationship is estimated from the data with the same function that is used to normalize count data and (ii) from this relationship, a transformation of the data is performed in order to get a dataset in which the variance is independent of the mean. The homoscedasticity is a prerequisite for the use of some data analysis methods, such as hierarchical clustering or Principal Component Analysis (PCA). The regularized log transformation is based on a GLM (Generalized Linear Model) on the counts and has the same goal as a VST but is more robust in the case when the size factors vary widely.

Figure 6 shows the dendrogram obtained from VST-transformed data. An euclidean distance is computed between samples, and the dendrogram is built upon the Ward criterion. We expect this dendrogram to group replicates and separate biological conditions.

Figure 6: Sample clustering based on normalized data.

Figure 6: Sample clustering based on normalized data.

Another way of visualizing the experiment variability is to look at the first principal components of the PCA, as shown on the figure 7. On this figure, the first principal component (PC1) is expected to separate samples from the different biological conditions, meaning that the biological variability is the main source of variance in the data.

Figure 7: First two components of a Principal Component Analysis, with percentages of variance associated with each axis.

Figure 7: First two components of a Principal Component Analysis, with percentages of variance associated with each axis.

For the statistical analysis, we need to take into account the effect of the Sex parameter. Statistical models and tests will thus be adjusted on it.

4 Normalization

Normalization aims at correcting systematic technical biases in the data, in order to make read counts comparable across samples. The normalization proposed by DESeq2 relies on the hypothesis that most features are not differentially expressed. It computes a scaling factor for each sample. Normalized read counts are obtained by dividing raw read counts by the scaling factor associated with the sample they belong to. Scaling factors around 1 mean (almost) no normalization is performed. Scaling factors lower than 1 will produce normalized counts higher than raw ones, and the other way around. Two options are available to compute scaling factors: locfunc=“median” (default) or locfunc=“shorth”. Here, the normalization was performed with locfunc=“median”.

Table 5: Normalization factors.
A11_Mem_12 A12_Mem_13 A7_Mem_7 B11_Mem_27 B5_Mem_19 B7_Mem_21 C10_Mem_42 C2_Mem_30 C4_Mem_32 C9_Mem_40 A10_Mem_10 B9_Mem_24 C1_Mem_29 C11_Mem_44 C7_Mem_36 C8_Mem_38 D1_Mem_46 D2_Mem_47 D3_Mem_49 D4_Mem_54 A3_Mem_3 B1_Mem_14 B10_Mem_25 B2_Mem_15 B3_Mem_16 B4_Mem_18 B8_Mem_22 C12_Mem_45 C3_Mem_31 C6_Mem_34 A1_Mem_1 A2_Mem_2 A4_Mem_4 A5_Mem_5 A6_Mem_6 A8_Mem_8 A9_Mem_9 B12_Mem_28 B6_Mem_20 C5_Mem_33
Size factor 0.94 0.90 0.90 0.79 1.26 0.94 0.99 1.12 1.09 1.05 0.90 1.02 1.04 0.82 0.94 1.06 0.95 0.91 0.94 1.04 1.13 1.02 1.13 1.12 1.08 1.19 1.15 0.94 0.91 1.08 0.98 0.92 0.97 1.06 1.16 1.12 0.86 0.95 1.24 1.09

The histograms (figure 8) can help to validate the choice of the normalization parameter (“median” or “shorth”). Under the hypothesis that most features are not differentially expressed, each size factor represented by a red line is expected to be close to the mode of the distribution of the counts divided by their geometric means across samples.

Figure 8: Diagnostic of the estimation of the size factors.

Figure 8: Diagnostic of the estimation of the size factors.

The figure 9 shows that the scaling factors of DESeq2 and the total count normalization factors may not perform similarly.

Figure 9: Plot of the estimated size factors and the total number of reads per sample.

Figure 9: Plot of the estimated size factors and the total number of reads per sample.

Boxplots are often used as a qualitative measure of the quality of the normalization process, as they show how distributions are globally affected during this process. We expect normalization to stabilize distributions across samples. Figure 10 shows boxplots of raw (left) and normalized (right) data respectively.

Figure 10: Boxplots of raw (left) and normalized (right) read counts.

Figure 10: Boxplots of raw (left) and normalized (right) read counts.

5 Differential analysis

5.1 Modelisation

DESeq2 aims at fitting one linear model per feature. For this project, the design used is counts ~ Sex + Treatment and the goal is to estimate the models’ coefficients which can be interpreted as \(\log_2(\texttt{FC})\). These coefficients will then be tested to get p-values and adjusted p-values.

5.2 Outlier detection

Model outliers are features for which at least one sample seems unrelated to the experimental or study design. For every feature and for every sample, the Cook’s distance [6] reflects how the sample matches the model. A large value of the Cook’s distance indicates an outlier count and p-values are not computed for the corresponding feature.

5.3 Dispersions estimation

The DESeq2 model assumes that the count data follow a negative binomial distribution which is a robust alternative to the Poisson law when data are over-dispersed (the variance is higher than the mean). The first step of the statistical procedure is to estimate the dispersion of the data. Its purpose is to determine the shape of the mean-variance relationship. The default is to apply a GLM (Generalized Linear Model) based method (fitType=“parametric”), which can handle complex designs but may not converge in some cases. The alternative is to use fitType=“local” as described in the original paper [3] or fitType=“mean”. The parameter used for this project is fitType=“parametric”. Then, DESeq2 imposes a Cox Reid-adjusted profile likelihood maximization [7] and uses the maximum a posteriori (MAP) of the dispersion [Wu, 2013].

Figure 11: Dispersion estimates (left) and diagnostic of log-normality (right).

Figure 11: Dispersion estimates (left) and diagnostic of log-normality (right).

The left panel on figure 11 shows the result of the dispersion estimation step. The x- and y-axes represent the mean count value and the estimated dispersion respectively. Black dots represent empirical dispersion estimates for each feature (from the observed counts). The red dots show the mean-variance relationship function (fitted dispersion value) as estimated by the model. The blue dots are the final estimates from the maximum a posteriori and are used to perform the statistical test. Blue circles (if any) point out dispersion outliers. These are features with a very high empirical variance (computed from observed counts). These high dispersion values fall far from the model estimation. For these features, the statistical test is based on the empirical variance in order to be more conservative than with the MAP dispersion. These features will have low chance to be declared significant. The figure on the right panel allows to check the hypothesis of log-normality of the dispersions.

5.4 Statistical test for differential expression

Once the dispersion estimation and the model fitting have been done, DESeq2 can perform the statistical testing. Figure 12 shows the distributions of raw p-values computed by the statistical test for the comparison(s) done. This distribution is expected to be a mixture of a uniform distribution on \([0,1]\) and a peak around 0 corresponding to the differentially expressed features.

Figure 12: Distribution(s) of raw p-values.

Figure 12: Distribution(s) of raw p-values.

5.5 Independent filtering

DESeq2 can perform an independent filtering to increase the detection power of differentially expressed features at the same experiment-wide type I error. Since features with very low counts are not likely to see significant differences typically due to high dispersion, it defines a threshold on the mean of the normalized counts irrespective of the biological condition. This procedure is independent because the information about the variables in the design formula is not used [4].

Table 6 reports the thresholds used for each comparison and the number of features discarded by the independent filtering. Adjusted p-values of discarded features are then set to NA.
Table 6: Number of features discarded by the independent filtering for each comparison.
Test vs Ref Threshold # discarded
Combination vs Control 0.02 4956
Cort vs Control 0.03 5869
Pregnanedione vs Control 0.02 5430
Cort vs Combination 0.02 4956
Pregnanedione vs Combination 0.02 5430
Pregnanedione vs Cort 0.02 4956

5.6 Final results

A p-value adjustment is performed to take into account multiple testing and control the false positive rate to a chosen level \(\alpha\). For this analysis, a BH p-value adjustment was performed [8] and the level of controlled false positive rate was set to 0.05.

Table 7: Number of up-, down- and total number of differentially expressed features for each comparison.
Test vs Ref # down # up # total
Combination vs Control 22 17 39
Cort vs Control 23 26 49
Pregnanedione vs Control 28 10 38
Cort vs Combination 10 17 27
Pregnanedione vs Combination 46 54 100
Pregnanedione vs Cort 61 79 140

Figure 13 represents the MA-plot of the data for the comparisons done, where differentially expressed features are highlighted in red. A MA-plot represents the log ratio of differential expression as a function of the mean intensity for each feature. Triangles correspond to features having a too low/high \(\log_2(\text{FC})\) to be displayed on the plot.

Figure 13: MA-plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Figure 13: MA-plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Figure 14 shows the volcano plots for the comparisons performed and differentially expressed features are still highlighted in red. A volcano plot represents the log of the adjusted P value as a function of the log ratio of differential expression.

Figure 14: Volcano plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Figure 14: Volcano plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Full results as well as lists of differentially expressed features are provided in the following text files which can be easily read in a spreadsheet. For each comparison:

  • TestVsRef.complete.txt contains results for all the features;
  • TestVsRef.up.txt contains results for significantly up-regulated features. Features are ordered from the most significant adjusted p-value to the less significant one;
  • TestVsRef.down.txt contains results for significantly down-regulated features. Features are ordered from the most significant adjusted p-value to the less significant one.

These files contain the following columns:

  • Id: unique feature identifier;
  • sampleName: raw counts per sample;
  • norm.sampleName: rounded normalized counts per sample;
  • baseMean: base mean over all samples;
  • Control, Combination, Cort and Pregnanedione: means (rounded) of normalized counts of the biological conditions;
  • FoldChange: fold change of expression, calculated as \(2^{\log_2(\text{FC})}\);
  • log2FoldChange: \(\log_2(\text{FC})\) as estimated by the GLM model. It reflects the differential expression between Test and Ref and can be interpreted as \(\log_2(\frac{\text{Test}}{\text{Ref}})\). If this value is:
    • around 0: the feature expression is similar in both conditions;
    • positive: the feature is up-regulated (\(\text{Test} > \text{Ref}\));
    • negative: the feature is down-regulated (\(\text{Test} < \text{Ref}\));
  • stat: Wald statistic for the coefficient tested;
  • pvalue: raw p-value from the statistical test;
  • padj: adjusted p-value on which the cut-off \(\alpha\) is applied;
  • dispGeneEst: dispersion parameter estimated from feature counts (i.e. black dots on figure 11);
  • dispFit: dispersion parameter estimated from the model (i.e. red dots on figure 11);
  • dispMAP: dispersion parameter estimated from the Maximum A Posteriori model;
  • dispersion: final dispersion parameter used to perform the test (i.e. blue dots and circles on figure 11);
  • betaConv: convergence of the coefficients of the model (TRUE or FALSE);
  • maxCooks: maximum Cook’s distance of the feature.

6 R session information and parameters

The versions of the R software and Bioconductor packages used for this analysis are listed below. It is important to save them if one wants to re-perform the analysis in the same conditions.

  • R version 3.5.1 (2018-07-02), x86_64-apple-darwin15.6.0
  • Locale: en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
  • Running under: macOS High Sierra 10.13.6
  • Matrix products: default
  • BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
  • LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
  • Base packages: base, datasets, graphics, grDevices, methods, parallel, stats, stats4, utils
  • Other packages: Biobase 2.42.0, BiocGenerics 0.28.0, BiocParallel 1.16.6, DelayedArray 0.8.0, DESeq2 1.22.2, dplyr 0.8.3, edgeR 3.24.3, forcats 0.4.0, genefilter 1.64.0, GenomeInfoDb 1.18.2, GenomicRanges 1.34.0, ggplot2 3.2.0, IRanges 2.16.0, limma 3.38.3, matrixStats 0.54.0, purrr 0.3.2, readr 1.3.1, readxl 1.3.1, S4Vectors 0.20.1, SARTools 1.6.6, stringr 1.4.0, SummarizedExperiment 1.12.0, tibble 2.1.1, tidyr 0.8.3, tidyverse 1.2.1, xtable 1.8-4
  • Loaded via a namespace (and not attached): acepack 1.4.1, annotate 1.60.1, AnnotationDbi 1.44.0, assertthat 0.2.1, backports 1.1.4, base64enc 0.1-3, bit 1.1-14, bit64 0.9-7, bitops 1.0-6, blob 1.2.0, broom 0.5.2, cellranger 1.1.0, checkmate 1.9.4, cli 1.1.0, cluster 2.1.0, colorspace 1.4-1, compiler 3.5.1, crayon 1.3.4, data.table 1.12.2, DBI 1.0.0, digest 0.6.20, evaluate 0.14, foreign 0.8-71, Formula 1.2-3, geneplotter 1.60.0, generics 0.0.2, GenomeInfoDbData 1.2.0, glue 1.3.1, grid 3.5.1, gridExtra 2.3, gtable 0.3.0, haven 2.1.1, Hmisc 4.2-0, hms 0.5.0, htmlTable 1.13.1, htmltools 0.3.6, htmlwidgets 1.3, httr 1.4.0, jsonlite 1.6, knitr 1.24, lattice 0.20-38, latticeExtra 0.6-28, lazyeval 0.2.2, locfit 1.5-9.1, lubridate 1.7.4, magrittr 1.5, Matrix 1.2-17, memoise 1.1.0, modelr 0.1.4, munsell 0.5.0, nlme 3.1-140, nnet 7.3-12, packrat 0.5.0, pillar 1.4.2, pkgconfig 2.0.2, R6 2.4.0, RColorBrewer 1.1-2, Rcpp 1.0.2, RCurl 1.95-4.12, rlang 0.4.0, rmarkdown 1.14, rpart 4.1-15, rsconnect 0.8.15, RSQLite 2.1.2, rstudioapi 0.10, rvest 0.3.4, scales 1.0.0, splines 3.5.1, stringi 1.4.3, survival 2.44-1.1, tidyselect 0.2.5, tools 3.5.1, vctrs 0.2.0, withr 2.1.2, xfun 0.8, XML 3.98-1.19, xml2 1.2.0, XVector 0.22.0, yaml 2.2.0, zeallot 0.1.0, zlibbioc 1.28.0

Parameter values used for this analysis are:

  • workDir: /Users/kfield/Documents/chickeneggs/SARTools.DESeq2.Kallisto.Transcripts.batchSex.byGroup
  • projectName: SARTools.DESeq2.Kallisto.Transcripts.batchSex.byGroup
  • author: Ken Field
  • targetFile: ../sex.eggs.target.txt
  • rawDir: ../kallisto/
  • featuresToRemove: NULL
  • varInt: Treatment
  • condRef: Control
  • batch: Sex
  • fitType: parametric
  • cooksCutoff: TRUE
  • independentFiltering: TRUE
  • alpha: 0.05
  • pAdjustMethod: BH
  • typeTrans: VST
  • locfunc: median
  • colors: dodgerblue, firebrick1, MediumVioletRed, SpringGreen

Bibliography

1. R Core Team. R: A language and environment for statistical computing. Vienna, Austria : R Foundation for Statistical Computing, 2017 :

2. Gentleman RC, Carey VJ, Bates DM et al. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 2004 ; 5: R80.

3. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology 2010 ; 11: R106.

4. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology 2014 ; 15: 550.

5. Schulze SK, Kanwar R, Gölzenleuchter M et al. SERE: Single-parameter quality control and sample comparison for rna-seq. BMC Genomics 2012 ; 13: 524.

6. Cook RD. Detection of influential observation in linear regression. Technometrics 1977 ; 19: 15–18.

7. Cox DR, Reid N. Parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society. Series B (Methodological) 1987 ; 49: 1–39.

8. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 1995 ; 57: 289–300.